:orphan: Core Basics 1: Train, Evaluate and Deploy a Classifier ====================================================== In this lesson we will learn how to train, evaluate and deploy classifiers with Khiops. Make sure you have installed `Khiops `__ and `Khiops Visualization `__. We start by importing Khiops and defining some helper functions: .. code:: ipython3 import os import platform import subprocess from khiops import core as kh # Define peek helper function def peek(file_path, n=10): """Shows the first n lines of a file""" with open(file_path, encoding="utf8", errors="replace") as file: for line in file.readlines()[:n]: print(line, end="") print("") # If there are any issues, you may print Khiops status with the following command: # kh.get_runner().print_status() Training a Classifier --------------------- We’ll train a classifier for the ``Iris`` dataset. This is a classical dataset containing the data of different plants belonging to the genus *Iris*. It contains 150 records, 50 for each of three variants of *Iris*: *Setosa*, *Virginica* and *Versicolor*. The records for each sample contain the length and width of its petal and sepal. The standard task for this dataset is to construct a classifier for the type of *Iris* taking as inputs the length and width characteristics. Now to train a classifier with Khiops, we use two types of files: - A plain-text delimited data file (for example a ``csv`` file) - A *dictionary* file which describes the schema of the above data table (``.kdic`` file extension) Let’s save, into variables, the locations of these files for the ``Iris`` dataset and then take a look at their contents: .. code:: ipython3 iris_kdic = os.path.join(kh.get_samples_dir(), "Iris", "Iris.kdic") iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt") print(f"Iris dictionary file: {iris_kdic}") peek(iris_kdic) print(f"Iris data file: {iris_data_file}\n") peek(iris_data_file) .. parsed-literal:: Iris dictionary file: /github/home/khiops_data/samples/Iris/Iris.kdic Dictionary Iris { Numerical SepalLength ; Numerical SepalWidth ; Numerical PetalLength ; Numerical PetalWidth ; Categorical Class ; }; Iris data file: /github/home/khiops_data/samples/Iris/Iris.txt SepalLength SepalWidth PetalLength PetalWidth Class 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3.0 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa 5.0 3.6 1.4 0.2 Iris-setosa 5.4 3.9 1.7 0.4 Iris-setosa 4.6 3.4 1.4 0.3 Iris-setosa 5.0 3.4 1.5 0.2 Iris-setosa 4.4 2.9 1.4 0.2 Iris-setosa Note that the *Iris* variant information is in the column ``Class``. Now let’s specify the path to the analysis report file. .. code:: ipython3 analysis_report_file_path_Iris = os.path.join("exercises", "Iris", "AnalysisReport.khj") print(f"Iris analysis report file path: {analysis_report_file_path_Iris}") .. parsed-literal:: Iris analysis report file path: exercises/Iris/AnalysisReport.khj We are now ready to train the classifier with the Khiops function ``train_predictor``. This method returns a tuple containing the location of two files: - the modeling report (``AnalysisReport.khj``): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics. It is saved into ``analysis_report_file_path_Iris`` variable that we just defined. - model’s *dictionary* file (``AnalysisReport.model.kdic``): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data. .. code:: ipython3 iris_report, iris_model_kdic = kh.train_predictor( iris_kdic, dictionary_name="Iris", data_table_path=iris_data_file, target_variable="Class", analysis_report_file_path=analysis_report_file_path_Iris, max_trees=0, # by default Khiops constructs 10 decision tree variables ) print(f"Iris report file: {iris_report}") print(f"Iris modeling dictionary: {iris_model_kdic}") .. parsed-literal:: Iris report file: exercises/Iris/AnalysisReport.khj Iris modeling dictionary: exercises/Iris/AnalysisReport.model.kdic Note that ``iris_report`` (the first element of the tuple returned by train_predictor) is identical to ``analysis_report_file_path_Iris``. In the next sections, we’ll use the file at ``iris_report`` to assess the models’ performances and the file at ``iris_model_kdic`` to deploy it. Now we can have a look at the report with the Khiops Visualization app: .. code:: ipython3 # To visualize uncomment the line below # kh.visualize_report(iris_report) Exercise ~~~~~~~~ We’ll repeat the previous steps on the ``Adult`` dataset. This dataset contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable ``class``, which indicates if the individual earns ``more`` or ``less`` than 50,000 dollars. Let’s start by putting, into variables, the paths for the ``Adult`` dataset: .. code:: ipython3 adult_kdic = os.path.join(kh.get_samples_dir(), "Adult", "Adult.kdic") adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt") Print the file locations and use the function ``peek`` to list their contents ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 print(f"Adult dictionary file: {adult_kdic}") peek(adult_kdic) print(f"Adult data file: {adult_data_file}\n") peek(adult_data_file) .. parsed-literal:: Adult dictionary file: /github/home/khiops_data/samples/Adult/Adult.kdic Dictionary Adult { Categorical Label ; Numerical age ; Categorical workclass ; Numerical fnlwgt ; Categorical education ; Numerical education_num ; Categorical marital_status ; Adult data file: /github/home/khiops_data/samples/Adult/Adult.txt Label age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country class 1 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States less 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States less 3 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States less 4 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States less 5 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba less 6 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States less 7 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica less 8 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States more 9 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States more We now specify the path to the analysis report file for this exercise: .. code:: ipython3 analysis_report_file_path_Adult = os.path.join( "exercises", "Adult", "AnalysisReport.khj" ) print(f"Adult analysis report file path: {analysis_report_file_path_Adult}") .. parsed-literal:: Adult analysis report file path: exercises/Adult/AnalysisReport.khj Train a classifier for the ``Adult`` database ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note the name of the target variable is ``class`` (**in lower case!**). Do not forget to set ``max_trees=0``. Save the resulting file locations into the variables ``adult_report`` and ``adult_model_kdic`` and print them. .. code:: ipython3 adult_report, adult_model_kdic = kh.train_predictor( adult_kdic, dictionary_name="Adult", data_table_path=adult_data_file, target_variable="class", analysis_report_file_path=analysis_report_file_path_Adult, max_trees=0, ) print(f"Adult report file: {adult_report}") print(f"Adult modeling dictionary file: {adult_model_kdic}") .. parsed-literal:: Adult report file: exercises/Adult/AnalysisReport.khj Adult modeling dictionary file: exercises/Adult/AnalysisReport.model.kdic Inspect the results with the Khiops Visualization app ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 # To visualize uncomment the line below # kh.visualize_report(adult_report) Accessing a Classifiers’ Basic Evaluation Metrics ------------------------------------------------- We access the classifier’s evaluation metrics by loading the file at ``iris_report`` with the Khiops function ``read_analysis_results_file``: .. code:: ipython3 iris_results = kh.read_analysis_results_file(iris_report) print(type(iris_results)) .. parsed-literal:: The resulting object is an instance of the ``AnalysisResults`` class. The model evaluation reports are stored in its ``train_evaluation_report`` and ``test_evaluation_report`` attributes which are of class ``EvaluationReport``. .. code:: ipython3 iris_train_eval = iris_results.train_evaluation_report iris_test_eval = iris_results.test_evaluation_report print(type(iris_train_eval)) print(type(iris_test_eval)) .. parsed-literal:: We access the default predictor’s metrics with the ``get_snb_performance`` method of the evaluation report objects: .. code:: ipython3 iris_train_performance = iris_train_eval.get_snb_performance() iris_test_performance = iris_test_eval.get_snb_performance() These objects are of class ``PredictorPerformance``. They have access to ``accuracy`` and ``auc`` attributes: .. code:: ipython3 print(f"Iris train accuracy: {iris_train_performance.accuracy}") print(f"Iris test accuracy: {iris_test_performance.accuracy}") print("") print(f"Iris train AUC: {iris_train_performance.auc}") print(f"Iris test AUC: {iris_test_performance.auc}") .. parsed-literal:: Iris train accuracy: 0.980952 Iris test accuracy: 0.955556 Iris train AUC: 0.998134 Iris test AUC: 0.984362 Exercise ~~~~~~~~ Read the contents of the file at ``adult_report`` for the Adult analysis and print its type ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_results = kh.read_analysis_results_file(adult_report) type(adult_results) .. parsed-literal:: khiops.core.analysis_results.AnalysisResults Save the evaluation reports of the ``Adult`` classification to the variables ``adult_train_eval`` and ``adult_test_eval`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_train_eval = adult_results.train_evaluation_report adult_test_eval = adult_results.test_evaluation_report Show the model’s train and test accuracies and AUCs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_train_performance = adult_train_eval.get_snb_performance() adult_test_performance = adult_test_eval.get_snb_performance() print(f"Adult train accuracy: {adult_train_performance.accuracy}") print(f"Adult test accuracy: {adult_test_performance.accuracy}") print("") print(f"Adult train AUC: {adult_train_performance.auc}") print(f"Adult test AUC: {adult_test_performance.auc}") .. parsed-literal:: Adult train accuracy: 0.86947 Adult test accuracy: 0.86592 Adult train AUC: 0.926153 Adult test AUC: 0.921511 Deploying a Classifier ---------------------- We are going to deploy the ``Iris`` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file ``iris_model_kdic``. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Let’s take a quick look at its contents: .. code:: ipython3 peek(iris_model_kdic, 25) .. parsed-literal:: #Khiops 11.0.0-b.0 Dictionary SNB_Iris { Unused Numerical SepalLength ; Unused Numerical SepalWidth ; Unused Numerical PetalLength ; Unused Numerical PetalWidth ; Unused Categorical Class ; Unused Structure(DataGrid) VClass = DataGrid(ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 32, 35)) ; Unused Structure(DataGrid) PPetalLength = DataGrid(IntervalBounds(3.15, 4.75, 5.15), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 1, 26, 5, 0, 0, 0, 9, 26)) ; // DataGrid(PetalLength, Class) Unused Structure(DataGrid) PPetalWidth = DataGrid(IntervalBounds(0.75, 1.75), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 31, 1, 0, 2, 33)) ; // DataGrid(PetalWidth, Class) Unused Structure(Classifier) SNBClass = SNBClassifier(Vector(0.453125, 0.5), DataGridStats(PPetalLength, PetalLength), DataGridStats(PPetalWidth, PetalWidth), VClass) ; Categorical PredictedClass = TargetValue(SNBClass) ; Unused Numerical ScoreClass = TargetProb(SNBClass) ; Numerical `ProbClassIris-setosa` = TargetProbAt(SNBClass, "Iris-setosa") ; Numerical `ProbClassIris-versicolor` = TargetProbAt(SNBClass, "Iris-versicolor") ; Numerical `ProbClassIris-virginica` = TargetProbAt(SNBClass, "Iris-virginica") ; }; Note that the modeling dictionary contains 4 used variables: - ``PredictedClass`` : The class with the highest probability according to the model - ``ProbClassIris-setosa``, ``ProbClassIris-versicolor``, ``ProbClassIris-virginica``: The probabilities of each class according to the model These will be the columns of the table obtained after deploying the model. This table will be saved at ``iris_deployment_file``. .. code:: ipython3 iris_deployment_file = os.path.join("exercises", "Iris", "iris_deployment.txt") kh.deploy_model( iris_model_kdic, dictionary_name="SNB_Iris", data_table_path=iris_data_file, output_data_table_path=iris_deployment_file, ) peek(iris_deployment_file) .. parsed-literal:: PredictedClass ProbClassIris-setosa ProbClassIris-versicolor ProbClassIris-virginica Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Iris-setosa 0.9935139877 0.004559173379 0.001926838879 Exercise ~~~~~~~~ Use the ``deploy_model`` function to deploy the model stored in the file at ``adult_model_kdic`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Which columns are deployed? .. code:: ipython3 adult_deployment_file = os.path.join("exercises", "Adult", "adult_deployment.txt") kh.deploy_model( adult_model_kdic, dictionary_name="SNB_Adult", data_table_path=adult_data_file, output_data_table_path=adult_deployment_file, ) peek(adult_deployment_file) .. parsed-literal:: Predictedclass Probclassless Probclassmore less 0.9999926806 7.319380182e-06 more 0.4107568382 0.5892431618 less 0.9622314248 0.03776857516 less 0.9172269213 0.08277307874 less 0.5833340928 0.4166659072 more 0.2619499457 0.7380500543 less 0.9940101932 0.005989806772 more 0.4199564537 0.5800435463 more 0.001247535351 0.9987524646